Putting Things in Order: On the Fundamental Role of Ranking in Classification and Probability Estimation
نویسنده
چکیده
While a binary classifier aims to distinguish positives from negatives, a ranker orders instances from high to low expectation that the instance is positive. Most classification models in machine learning output some score of ‘positiveness’, and hence can be used as rankers. Conversely, any ranker can be turned into a classifier if we have some instance-independent means of splitting the ranking into positive and negative segments. This could be a fixed score threshold; a point obtained from fixing the slope on the ROC curve; the break-even point between true positive and true negative rates; to mention just a few possibilities. These connections between ranking and classification notwithstanding, there are considerable differences as well. Classification performance on n examples is measured by accuracy, an O(n) operation; ranking performance, on the other hand, is measured by the area under the ROC curve (AUC), an O(n logn) operation. The model with the highest AUC does not necessarily dominate all other models, and thus it is possible that another model would achieve a higher accuracy for certain operating conditions, even if its AUC is lower. However, within certain model classes good ranking performance and good classification performance are more closely related than suggested by the previous remarks. For instance, there is evidence that certain classification models, while designed to optimise accuracy, in effect optimise an AUC-based loss function [1]. It has also been known for some time that decision tree yield convex training set ROC curves by construction [2], and thus optimising training set accuracy is likely to lead to good training set AUC. In this talk I will investigate the relation between ranking and classification more closely. I will also consider the connection between ranking and probability estimation. The quality of probability estimates can be measured by, e.g., mean squared error in the probability estimates (the Brier score). However, like accuracy, this is an O(n) operation that doesn’t fully take ranking performance into account. I will show how a novel decomposition of the Brier score into calibration loss and refinement loss [3] sheds light on both ranking and probability estimation performance. While previous decompositions are approximate [4], our decomposition is an exact one based on the ROC convex hull. (The connection between the ROC convex hull and calibration was independently noted by [5]). In the case of decision trees, the analysis explains the empirical evidence that probability estimation trees produce well-calibrated probabilities [6]. Invited speakers at ECML/PKDD are supported by the PASCAL European network of
منابع مشابه
شناسایی و رتبه بندی خدمات اینترنت اشیا در حوزه سلامت
Introduction: The Internet of Things is a system of connected physical objects that are accessible through the internet. It has been widely applied to connect available medical resources and provide reliable, effective and smart healthcare services to people. Therefore, the aim of this paper was to identify and rank the internet of things in healthcare services. Methods: In this applied resear...
متن کاملRanking the Trading Symbols of the Largest Companies Listed in the Tehran Stock Exchange Based on the Probability of Informed Trade Criteria
I n this paper, trading symbols of the 30 largest companies listed in the Tehran Stock Exchange (TSE) were ranked based on the asymmetry information risk. Using the Ersan and Alici (2016) modified clustering algorithm (EA), we estimated the probability of informed trading (PIN) to measure the asymmetry information among traders for each trading symbol and trading day through two-year...
متن کاملDischarge Estimation by using Tsallis Entropy Concept
Flow-rate measurement in rivers under different conditions is required for river management purposes including water resources planning, pollution prevention, and flood control. This study proposed a new discharge estimation method by using a mean velocity derived from a 2D velocity distribution formula based on Tsallis entropy concept. This procedure is done based on several factors which refl...
متن کاملبرآورد احتمال نکول تسهیلات پرداختی بانک با استفاده از رگرسیون لاجیت
Existence of risk in banking operations could threaten profitability of banks. Observed crises in banking system mainly were because of inefficiency rolling in credit risk management. The most important instrument that banks need to adopt for monitoring and management of credit risk is customer’s ranking system. In this way the main objective of the present research is to estimate a Logit model...
متن کاملInvestigate Factors Affecting on the Performance of Agricultural Machinery Companies Based on Taxonomy Algorithm
Taxonomy(general), the practice and science of classification of things or concepts, including the principles that underlie such classification. Economic taxonomy, a system of classification for economic activity. The main objective of the study was to find whether financial ratios affect the performance of the Agricultural Machinery companies in Iran. A firm performance evaluation and its comp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007